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Abstract 

A  lot  of  effort  has  been  made  in  computational  auditory  scene  analysis  (CASA)  to  segregate  speech 
from  monaural  mixtures.  The  performance  of  current  CASA  systems  on  voiced  speech  segregation  is 
limited  by  lacking  a  robust  algorithm  for  pitch  estimation.  We  propose  a  tandem  algorithm  that  performs 
pitch  estimation  of  a  target  utterance  and  segregation  of  voiced  portions  of  target  speech  jointly  and 
iteratively.  This  algorithm  first  obtains  a  rough  estimate  of  target  pitch,  and  then  uses  this  estimate  to 
segregate  target  speech  using  harmonicity  and  temporal  continuity.  It  then  improves  both  pitch  estimation 
and  voiced  speech  segregation  iteratively.  Systematic  evaluation  shows  that  the  tandem  algorithm  extracts 
a  majority  of  target  speech  without  including  much  interference,  and  it  performs  substantially  better  than 
previous  systems  for  either  pitch  extraction  or  voiced  speech  segregation. 
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I. 


Introduction 


Speech  segregation,  or  the  cocktail  party  problem,  is  a  well-known  challenge  with  important 
applications.  For  example,  automatic  speech  recognition  (ASR)  systems  perform  substantially  worse  in 
the  presence  of  interfering  sounds  [25]  [33]  and  could  greatly  benefit  from  an  effective  speech 
segregation  system.  Background  noise  also  presents  a  major  difficulty  to  hearing  aid  wearers,  and  noise 
reduction  is  considered  a  great  challenge  for  hearing  aid  design  [12].  Many  methods  have  been  proposed 
in  monaural  speech  enhancement  [26].  These  methods  usually  assume  certain  statistical  properties  of 
interference  and  tend  to  lack  the  capacity  to  deal  with  a  variety  of  interference.  While  voice  separation  has 
proven  to  be  difficult,  the  human  auditory  system  is  remarkably  adept  in  this  task.  The  perceptual  process 
is  considered  as  auditory >  scene  analysis  (ASA)  [6].  Psychoacoustic  research  in  ASA  has  inspired 
considerable  work  in  developing  computational  auditory  scene  analysis  (CASA)  systems  for  speech 
segregation  (see  [36]  for  a  comprehensive  review). 

Natural  speech  contains  both  voiced  and  unvoiced  portions,  and  voiced  portions  account  for  about  75- 
80%  of  spoken  English  [19].  Voiced  speech  is  characterized  by  periodicity  (or  harmonicity),  which  has 
been  used  as  a  primary  cue  in  many  CASA  systems  for  segregating  voiced  speech  (e.g.  [8]  [16]).  Despite 
considerable  advances  in  voiced  speech  separation,  the  performance  of  current  CASA  systems  is  still 
limited  by  pitch  (F0)  estimation  errors  and  residual  noise.  Various  methods  for  robust  pitch  estimation 
have  been  proposed  [31]  [37]  [11]  [22];  however,  robust  pitch  estimation  under  low  signal-to-noise 
(SNR)  situations  still  poses  a  significant  challenge.  Since  the  difficulty  of  robust  pitch  estimation  stems 
from  noise  interference,  it  is  desirable  to  remove  or  attenuate  interference  before  pitch  estimation.  On  the 
other  hand,  noise  removal  depends  on  accurate  pitch  estimation.  As  a  result,  pitch  estimation  and  voice 
separation  become  a  “chicken  and  egg”  problem  [11]. 

We  believe  that  a  key  to  resolve  the  above  dilemma  is  the  observation  that  one  does  not  need  the  entire 
target  signal  to  estimate  pitch  (a  few  harmonics  can  be  adequate),  and  without  perfect  pitch  one  can  still 
segregate  some  target  signal.  Thus,  we  suggest  a  strategy  that  estimates  target  pitch  and  segregates  the 
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target  in  tandem.  The  idea  is  that  we  first  obtain  a  rough  estimate  of  target  pitch,  and  then  use  this 
estimate  to  segregate  the  target  speech.  With  the  segregated  target,  we  should  generate  a  better  pitch 
estimate  and  can  use  it  for  better  segregation,  and  so  on.  In  other  words,  we  propose  an  algorithm  achieve 
pitch  estimation  and  speech  segregation  jointly  and  iteratively.  We  call  this  method  a  tandem  algorithm 
because  it  alternates  between  pitch  estimation  and  speech  segregation.  This  idea  was  present  in  a 
rudimentary  form  in  our  previous  system  for  voiced  speech  segregation  [16]  which  contains  two 
iterations. 

The  separation  part  of  our  tandem  system  aims  to  identify  the  ideal  binary  mask  (IBM).  With  a  time- 
frequency  (T-F)  representation,  the  IBM  is  a  binary  matrix  along  time  and  frequency  where  1  indicates 
that  the  target  is  stronger  than  interference  in  the  corresponding  T-F  unit  and  0  otherwise  (see  Fig.  5  later 
for  an  illustration).  To  simplify  notations,  we  refer  to  T-F  units  labeled  1  and  those  labeled  0  as  active 
and  inactive  units,  respectively.  We  have  suggested  that  the  IBM  is  a  reasonable  goal  for  CASA  [16]  [34], 
and  it  has  been  used  as  a  measure  of  ceiling  performance  for  speech  separation  [24]  [29]  [30].  Recent 
psychoacoustic  studies  provide  strong  evidence  that  the  IBM  leads  to  large  improvements  of  human 
speech  intelligibility  in  noise  [9]  [23]. 

This  paper  is  organized  as  follows.  Sect.  II  describes  T-F  decomposition  of  the  input  and  feature 
extraction.  The  tandem  algorithm  has  two  key  steps:  estimating  the  IBM  given  an  estimate  of  target  pitch 
and  estimating  the  target  pitch  given  an  estimated  IBM.  We  describe  these  two  steps  in  Sects.  Ill  and  IV. 
The  tandem  algorithm  is  then  presented  in  Sect.  V.  Systematic  evaluation  of  this  algorithm  on  pitch 
estimation  and  speech  segregation  is  given  in  Sect.  VI,  followed  by  concluding  remarks  in  Sect.  VII. 

II.  T-F  Decomposition  and  Feature  Extraction 

We  first  decompose  an  input  signal  in  the  frequency  domain  with  a  bank  of  128  gammatone  filters 
[28],  with  their  center  frequencies  equally  distributed  on  the  equivalent  rectangular  bandwidth  rate  scale 
from  50  Hz  to  8000  Hz  (see  [36]  for  details).  In  each  filter  channel,  the  output  is  divided  into  20-ms  time 
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frames  with  10-ms  overlap  between  consecutive  frames.  The  resulting  T-F  representation  is  known  as  a 
cochleagram  [36].  At  each  frame  of  each  channel,  we  compute  a  correlogram,  a  running  autocorrelation 
function  (ACF)  of  the  signal,  within  a  certain  period  of  time  delay.  Each  ACF  represents  the  periodicity 
of  the  filter  response  in  the  corresponding  T-F  unit.  Let  ucm  denote  a  T-F  unit  for  channel  c  and  frame  m 
and  x(c,t)  the  filter  response  for  channel  c  at  time  t.  The  corresponding  ACF  of  the  filter  response  is  given 
by 


A(c,m,  r) 


(1) 


Flere,  r  is  the  delay  and  n  denotes  discrete  time.  Tm  =  10  ms  is  the  frame  shift  and  Tn  is  the  sampling  time. 
The  above  summation  is  over  20  ms,  the  length  of  a  time  frame.  The  periodicity  of  the  filter  response  is 
indicated  by  the  peaks  in  the  ACF,  and  the  corresponding  delays  indicate  the  periods.  We  calculate  the 
ACF  within  the  following  range:  tT„g[ 0,  15  ms],  which  includes  the  plausible  pitch  frequency  range 
from  70  FIz  to  400  Flz  [27]. 

It  has  been  shown  that,  cross-channel  correlation,  which  measures  the  similarity  between  the  responses 
of  two  adjacent  filters,  indicates  whether  the  filters  are  responding  to  the  same  sound  component  [8]  [35]. 
Flence,  we  calculate  the  cross-channel  correlation  of  ucm  with  uc+\jn  by 


£r[A(c,m,r)-A(c,m)][A(c  +  l,m,T)-A(c  +  l,m)] 

\^\C  )Wl)  j  • 

“  ^(C™)]2ZrWC  +  “  A(C  +  l’m)T 


(2) 


where  A  denotes  the  average  of  A. 

When  the  input  contains  a  periodic  signal,  high-frequency  filters  respond  to  multiple  harmonics  of  the 
signal  and  these  harmonics  are  called  unresolved.  Unresolved  harmonics  trigger  filter  responses  that  are 
amplitude-modulated,  and  the  response  envelope  fluctuates  at  the  F0  of  the  signal  [14],  Flere  we  extract 
envelope  fluctuations  corresponding  to  target  pitch  by  half-wave  rectification  and  bandpass  filtering,  and 
the  passband  corresponds  to  the  plausible  F0  range  of  target  speech.  Then  we  compute  the  envelope  ACF, 
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AE(c,m,T )  ,  and  the  cross-channel  correlation  of  response  envelopes,  CE(c,m),  similar  to  Eqs.  (1)  and 

(2). 

III.  IBM  Estimation  Given  Target  Pitch 

A.  Unit  Labeling  with  Information  within  Individual  T-F  Units 

We  first  consider  a  simple  approach:  a  T-F  unit  is  labeled  1  if  and  only  if  the  corresponding  response 
or  response  envelope  has  a  periodicity  similar  to  that  of  the  target.  As  discussed  in  Sect.  II,  the  periodicity 
of  a  filter  response  is  indicated  by  the  peaks  in  the  corresponding  ACF.  Let  Ts(m)  be  the  estimated  pitch 
period  at  frame  m.  When  a  response  has  a  period  close  to  Ts(m),  the  corresponding  ACF  will  have  a  peak 
close  to  rs(m).  Previous  work  [16]  has  shown  that  A(c,  m,  rs(m))  is  a  good  measure  of  the  similarity 
between  the  response  period  in  ucm  and  estimated  pitch. 

Alternatively,  one  may  compare  the  instantaneous  frequency  of  the  filter  response  with  the  estimated 
pitch  directly.  However,  in  practice,  it  is  extremely  difficult  to  accurately  estimate  the  instantaneous 
frequency  of  a  signal  [3]  [4],  and  we  found  that  labeling  T-F  units  based  on  estimated  instantaneous 
frequency  does  not  perform  better  than  using  the  ACF-based  measures. 

We  propose  to  construct  a  classifier  that  combines  these  two  kinds  of  measure  to  label  T-F  units.  Let 

f(c,m )  denote  the  estimated  average  instantaneous  frequency  of  the  filter  response  within  unit  ucm.  If  the 
filter  response  has  a  period  close  to  rs(m) ,  then  f(c,m)  ■  Ts{m)  is  close  to  an  integer  greater  than  or 
equal  to  1.  Similarly,  let  fE(c,m )  be  the  estimated  average  instantaneous  frequency  of  the  response 
envelope  within  ucm.  If  the  response  envelope  fluctuates  at  the  period  of  rs(m) ,  then  fE(c,m )  •  rs ( m )  is 
close  to  1 .  Let 

rcm(T)  =  (A(c,m,r),  f(c,m)r  -  int(/(c,  m)r),  int  (f(c,m)r), 

AE(c,m,r),  fE(c,m)r -int(/£ (c,m)z),  in \.(fE(c,m)z)) 
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be  a  set  of  6  features,  the  first  3  of  which  correspond  to  the  filter  response  and  the  last  3  to  the  response 
envelope.  In  (3),  the  function  int(.r)  returns  the  nearest  integer.  Let  H0  be  the  hypothesis  that  a  T-F  unit  is 
target  dominant  and  Hi  otherwise.  ucrn  is  labeled  as  target  if  and  only  if 

Wo  I  rcm(rs(m)))  >  P(Hl  \  rcm(rs(m)))  (4) 

Since 

Wo  I  rcm(Ts(rn)j)  =  1  -  P(Hl  |  rcm(Ts(m)j) ,  (5) 

Eq.  (4)  becomes 

P(H0\rcm(Ts(m)))>0.5  (6) 

In  this  study,  we  estimate  the  instantaneous  frequency  of  the  response  within  a  T-F  unit  simply  as  half 
the  inverse  of  the  interval  between  zero-crossings  of  the  response  [4],  assuming  that  the  response  is 
approximately  sinusoidal.  Note  that  a  sinusoidal  function  crosses  zero  twice  within  a  period. 

For  classification,  we  use  a  multilayer  perceptron  (MLP)  with  one  hidden  layer  to  compute 
P(H0\rcm(  r))  for  each  filter  channel.  The  desired  output  of  the  MLP  is  1  if  the  corresponding  T-F  unit  is 
target  dominant  and  0  otherwise  (i.e.  the  IBM).  When  there  are  sufficient  training  samples,  the  trained 
MLP  yields  a  good  estimate  of  P(H0\rcm( r))  [7].  In  this  study,  the  MLP  for  each  channel  is  trained  with  a 
coipus  that  includes  all  the  utterances  from  the  training  part  of  the  T1M1T  database  [13]  and  100 
intrusions.  These  intrusions  include  crowd  noise  and  environmental  sounds,  such  as  wind,  bird  chirp,  and 
ambulance  alarm.1  Utterances  and  intrusions  are  mixed  at  0  dB  SNR  to  generate  training  samples;  the 
target  is  a  speech  utterance  and  interference  is  either  a  nonspeech  intrusion  or  another  utterance.  We  use 
Praat  [5]  to  estimate  target  pitch.  The  number  of  units  in  the  hidden  layer  is  determined  using  cross- 
validation.  Specifically,  we  divide  the  training  samples  equally  into  two  sets,  one  for  training  and  the 
other  for  validation.  The  number  of  units  in  the  hidden  layer  is  chosen  to  be  the  minimum  such  that 
adding  more  units  in  the  hidden  layer  will  not  yield  any  significant  performance  improvement  on  the 


1  The  intrusions  are  posted  at  http://www.cse.oliio-state.edu/pni/corpus/HuCorpus.html 
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validation  set.  Since  most  obtained  MLPs  have  5  units  in  their  hidden  layers,  we  let  every  MLP  have  5 
hidden  units  for  uniformity. 

Figs.  1(a)  and  1(b)  show  the  sample  ACFs  of  a  filter  response  and  the  response  envelope  in  a  T-F  unit. 
The  input  is  a  female  utterance,  “That  noise  problem  grows  more  annoying  each  day,”  from  the  T1MIT 
database.  This  unit  corresponds  to  a  channel  with  the  center  frequency  of  2.5  kFIz  and  a  time  frame  from 
790  ms  to  810  ms.  Fig.  1(c)  shows  the  corresponding  P(H0\rcm(r))  for  different  r  values.  The  maximum  of 
P{H(\rcm(  r))  is  located  at  5.87  ms,  the  pitch  period  of  the  utterance  at  this  frame. 


(a) 


Figure  1.  (a)  ACF  of  the  filter  response  within  a  T-F  unit  in  a  channel  centered  at  2.5  kFIz.  (b) 
Corresponding  ACF  of  the  response  envelope,  (c)  Probability  of  the  unit  being  target 
dominant  given  target  pitch  period  r. 


The  obtained  MLPs  are  used  to  label  individual  T-F  units  according  to  Eq.  (6).  Fig.  2(a)  shows  the 
resulting  error  rate  by  channel  for  all  the  mixtures  in  a  test  coipus  (see  Sect.  V.B).  The  error  rate  is  the 
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(a) 


Figure  2.  Error  percentage  in  T-F  unit  labeling  using  different  subsets  of  6  features  (see  text  for 
definitions)  given  target  pitch,  (a)  Comparison  between  all  6  features,  (b)  Comparison  between 
the  first  3  features,  (c)  Comparison  between  the  last  3  features. 


average  of  false  acceptance  and  false  rejection.  As  shown  in  the  figure,  with  features  derived  from 
individual  T-F  units,  we  can  label  about  70%  -90%  of  the  units  correctly  across  the  whole  frequency 
range.  In  general,  T-F  units  in  the  low-frequency  range  are  labeled  more  accurately  than  those  in  the  high- 
frequency  range.  Fig.  2  also  shows  the  error  rate  by  using  only  subsets  of  the  features  from  the  feature  set, 


rcm(  r).  As  shown  in  this  figure,  the  ACF  values  at  the  pitch  point  and  instantaneous  frequencies  provide 
complementary  information.  The  response  envelope  is  more  indicative  than  the  response  itself  in  the 
high-frequency  range.  Best  results  are  obtained  when  all  the  6  features  are  used. 

Besides  using  MLPs,  we  have  considered  modeling  the  distribution  of  rcm(  f)  using  a  Gaussian  mixture 
model  as  well  as  a  support  vector  machine  based  classifier  [15].  However,  the  results  are  not  better  than 
using  the  MLPs. 

B.  Multiple  Harmonic  Sources 

When  interference  contains  one  or  several  harmonic  signals,  there  are  time  frames  where  both  target 
and  interference  are  pitched.  In  such  a  situation,  it  is  more  reliable  to  label  a  T-F  unit  by  comparing  the 
period  of  the  signal  within  the  unit  with  both  the  target  pitch  period  and  the  interference  pitch  period.  In 
particular,  ucm  should  be  labeled  as  target  if  the  target  period  not  only  matches  the  period  of  the  signal  but 
also  matches  better  than  the  interference  period,  i.e., 

\p(H o  I  rCm(Ts(m)))  >  P(Hl  |  rcm(T's(m))) 

I  rcJTs(m)))  >  °'5 

where  z's{m)  is  the  pitch  period  of  the  interfering  sound  at  frame  m.  We  use  Eq.  (7)  to  label  T-F  units  for 
all  the  mixtures  of  two  utterances  in  the  test  corpus.  Both  target  pitch  and  interference  pitch  are  obtained 
by  applying  Praat  to  clean  utterances.  Fig.  3  shows  the  corresponding  error  rate  by  channel,  compared 
with  using  only  the  target  pitch  to  label  T-F  units.  As  shown  in  the  figure,  better  performance  is  obtained 
by  using  the  pitch  values  of  both  speakers. 

C.  Unit  Labeling  with  Information  from  a  Neighborhood  of  T-F  Units 

Labeling  a  T-F  unit  using  only  the  local  information  within  the  unit  still  produces  a  significant  amount 
of  error.  Since  speech  signal  is  wideband  and  exhibits  temporal  continuity,  neighboring  T-F  units 
potentially  provide  useful  information  for  unit  labeling.  For  example,  a  T-F  unit  surrounded  by  target- 


9 


Figure  3.  Percentage  of  error  in  T-F  unit  labeling  for  two-voice  mixtures  using  target  target 
pitch  only  or  both  target  and  interference  pitch. 

dominant  units  is  also  likely  target  dominant.  Therefore,  we  consider  information  from  a  local  context. 
Specifically,  we  label  ucm  as  target  if 

P(H0  |  {P(H0  |  rc,m(Ts(m')))})  >  0.5,  |  c'-c  |<  Nc,  \  m’-m  \<  Nm  (8) 

where  Nc  and  Nm  define  the  size  of  the  neighborhood  along  frequency  and  time,  respectively,  and 
{P(H0\rc,m,(Ts(m')))}  is  the  vector  that  contains  the  P(H0\rcm(Ts(m)))  values  of  T-F  units  within  the 
neighborhood.  Again,  for  each  frequency  channel,  we  train  an  MLP  with  one  hidden  layer  to  calculate  the 
probability  P(H0  \  {P(H0  \  r  using  the  P(H()\rcm( zs(m)))  values  within  the  neighborhood  as 

features. 

The  key  here  is  to  determine  the  appropriate  size  of  a  neighborhood.  Again,  we  divide  the  training 
samples  equally  into  two  sets  and  use  cross-validation  to  determine  Nc  and  Nm.  This  cross-validation 
procedure  suggests  that  Nc  =  8  and  Nm  =  2  define  an  appropriate  size  of  the  neighborhood.  By  utilizing 
information  from  neighboring  channels  and  frames,  we  reduce  the  average  percentage  of  false  rejection 
across  all  channels  from  20.8%  to  16.7%  and  the  average  percentage  of  false  acceptance  from  13.3%  to 
8.7%  for  the  test  corpus.  The  hidden  layer  of  such  a  trained  MLP  has  2  units,  also  determined  by  cross- 
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validation.  Note  that  when  both  target  and  interference  are  pitched,  we  label  a  T-F  unit  according  to  Eq. 
(7)  with  probability  P(H0  \  {P(H0  \  and  P{HX  \  {P(Ht  \  rc,m\r's{m')))})  . 

Since  P(H{)\rcm( Ts{m)))  is  derived  from  rcm{rs{m)),  we  have  also  considered  using  rcm{ Ts(mj)  directly  as 
features.  The  resulting  MLPs  are  much  more  complicated,  but  yield  no  performance  gain. 

IV.  Pitch  Determination  Given  Target  Mask 

A.  Integration  across  Channels 

Given  an  estimated  mask  of  the  voiced  target,  the  task  here  is  to  estimate  target  pitch.  Let  L(m)  = 
{L(c,  ni),  Vc}  be  the  set  of  binary  mask  labels  at  frame  m,  where  L(c,  m )  is  1  if  ucm  is  active  and  0 
otherwise.  A  frequently  used  method  for  pitch  determination  is  to  pool  autocorrelations  across  all  the 
channels  and  then  identify  a  dominant  peak  in  the  summary  correlogram  -  the  summation  of  ACFs  across 
all  the  channels  [11].  The  estimated  pitch  period  at  frame  m,  Ts(m),  is  the  lag  corresponding  to  the 
maximum  of  the  summary  ACF  in  the  plausible  pitch  range.  This  simple  method  of  pitch  estimation  is  not 
very  robust  when  interference  is  strong  because  the  autocorrelations  in  many  channels  exhibit  spurious 
peaks  not  corresponding  to  the  target  period.  One  may  solve  this  problem  by  removing  interference- 
dominant  T-F  units,  i.e.,  calculating  the  summary  correlogram  only  with  active  T-F  units: 

A(m,r)  =  ’YJA(c,?n,r)L(c,m)  (9) 

C 

Similar  to  the  ACF  of  the  filter  response,  the  profile  of  the  probability  that  unit  ucm  is  target  dominant 
given  pitch  period  r,  P(H0\rcm(  r)),  also  tends  to  have  a  significant  peak  at  the  target  period  when  ucm  is 
truly  target  dominant  (see  Fig.  1(c)).  One  can  use  the  corresponding  summation  of  P(H0\rcm(  r)), 

SPm(T)  =  YJP(H0\ran(T))L(c,m),  (10) 

to  identify  the  pitch  period  at  frame  m  as  the  maximum  of  the  summation  in  the  plausible  pitch  range. 

We  apply  the  above  two  methods  for  pitch  estimation  to  two  utterances  from  the  test  corpus,  one  from 
a  female  speaker  and  the  other  from  a  male  speaker.  These  two  utterances  are  mixed  with  20  intrusions  at 
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0  dB  SNR.  In  this  estimation,  we  use  the  IBM  at  the  voiced  frames  of  the  target  utterance  to  estimate  a 
pitch  period  at  each  frame.  The  percentages  of  estimation  error  for  the  two  methods  are  shown  in  the  first 
row  of  Table  1.  We  use  the  pitch  contours  obtained  by  applying  Praat  to  the  clean  target  as  the  ground 
truth  of  the  target  pitch.  An  error  occurs  when  the  estimated  pitch  period  and  the  pitch  period  obtained 
from  Praat  differ  by  more  than  5%.  As  shown  in  the  table,  using  the  summation  of  P(/70|rcm(  r))  performs 
much  better  than  using  the  summary  ACF  for  the  female  utterance.  Both  methods,  especially  the  one 
using  the  summary  ACF,  perform  better  on  the  male  utterance.  This  is  because  the  ACF  and  P(H0\rcm(  r)) 
in  target-dominant  T-F  units  all  exhibit  peaks  not  only  at  the  target  pitch  period,  but  also  at  time  lags 
multiple  the  pitch  period.  As  a  result,  their  summations  have  significant  peaks  not  only  at  the  target  pitch 
period,  but  also  at  its  integer  multiples,  especially  for  a  female  voice,  making  pitch  estimation  difficult. 

Table  1.  Error  rate  of  different  pitch  estimation  given  ideal  binary  mask. 


Method 

Summary 

ACF 

Summary 
P(Ho\rcm(  r)) 

Classifier 

F 

M 

F 

M 

F 

M 

Without  temporal  continuity 

39.6 

17.1 

18.1 

17.2 

15.6 

17.6 

With  temporal  continuity 

31.8 

16.3 

14.8 

15.8 

12.7 

16.8 

B.  Differentiating  True  Pitch  Period  from  Its  Integer  Multiples 

To  differentiate  a  target  pitch  period  from  its  integer  multiples  for  pitch  estimation,  we  need  to  take  the 
relative  locations  of  possible  pitch  candidates  into  consideration.  Let  T\  and  r2  be  two  pitch  candidates. 
We  train  an  MLP-based  classifier  that  selects  the  better  one  from  these  two  candidates  using  their  relative 
locations  and  SPm(  r)  as  features,  i.e.,  (ri/r2,  SPm(  r i ),  SPm(  r2)).  In  constructing  the  training  data,  we 
obtain  SP,„(t)  at  each  time  frame  from  all  the  target-dominant  T-F  units.  In  each  training  sample,  the  two 
pitch  candidates  are  the  true  target  pitch  period  and  the  lag  of  another  peak  of  SP,„(  r)  within  the  plausible 
pitch  range.  Without  loss  of  generality,  we  let  T\  <  r2.  The  desired  output  is  1  if  Ti  is  the  true  pitch  period 
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and  0  otherwise.  The  obtained  MLP  has  3  units  in  the  hidden  layer.  We  use  the  obtained  MLP  to  select 
the  better  one  from  the  two  candidates  as  follows:  if  the  output  of  the  MLP  is  higher  than  0.5,  we  consider 
T\  as  the  better  candidate;  otherwise,  we  consider  r2  as  the  better  candidate. 

The  target  pitch  is  estimated  with  the  classifier  as  follows: 

•  Find  all  the  local  maxima  in  SPm{  r)  within  the  plausible  pitch  range  as  pitch  candidates. 
Sort  these  candidates  according  to  their  time  lags  from  small  to  large  and  let  the  first 
candidate  be  the  current  estimated  pitch  period,  TS(m). 

•  Compare  the  current  estimated  pitch  period  with  the  next  candidate  using  the  obtained 
MLP  and  update  the  pitch  estimate  if  necessary. 

The  percentage  of  pitch  estimation  errors  with  the  classifier  is  shown  in  the  first  row  in  Table  1.  The 
classifier  reduces  the  error  rate  on  the  female  utterance  but  slightly  increases  the  error  rate  on  the  male 
utterance. 

C.  Pitch  Estimation  Using  Temporal  Continuity 

Speech  signals  exhibit  temporal  continuity,  i.e.,  their  structure,  such  as  frequency  partials,  tends  to  last 
for  a  certain  period  of  time  corresponding  to  a  syllable  or  phoneme,  and  the  signals  change  smoothly 
within  this  period.  Consequently,  the  pitch  and  the  ideal  binary  mask  of  a  target  utterance  tend  to  have 
good  temporal  continuity  as  well.  We  found  that  less  than  0.5%  of  consecutive  frames  have  more  than 
20%  relative  pitch  changes  for  utterance  in  our  training  set  [15].  Thus  we  utilize  pitch  continuity  to 
further  improve  pitch  estimation  as  follows. 

First,  we  check  the  reliability  of  the  estimated  pitch  based  on  temporal  continuity.  Specifically,  for 
every  three  consecutive  frames,  m—  1,  rn ,  and  m+ 1 ,  if  the  pitch  changes  are  all  less  than  20%,  i.e., 

jl  Cs'O)  -  Ts(m  ~  0 1<  0.2min(rs(m),  rs(m  - 1)) 

[|  rs(m)  -  rs(m  + 1)  |<  0.2min(rs(»i),  r s(m  + 1)) 

the  estimated  pitch  periods  in  these  three  frames  are  all  considered  reliable. 
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Second,  we  re-estimate  unreliable  pitch  points  by  limiting  the  plausible  pitch  range  using  neighboring 
reliable  pitch  points.  Specifically,  for  two  consecutive  time  frames,  m—  1  and  m,  if  Ts(m)  is  reliable  and 
zs(m  -1 )  is  unreliable,  we  re-estimate  zs(m  -1 )  by  limiting  the  plausible  pitch  range  for  Zs(m-\ )  to  be 
[0.8  r^m),  1 .2 Ts(m)],  and  vice  versa.  Another  possible  situation  is  that  zs(m)  is  unreliable  while  both 
zs(m  -1 )  and  Ts(m+ 1)  are  reliable.  In  this  case,  we  use  zs(m  -1 )  to  limit  the  plausible  pitch  range  of  zs{m) 
if  the  mask  at  frame  m  is  more  similar  to  the  mask  at  frame  m—  1  than  the  mask  at  frame  m+ 1,  i.e., 

Y,cL(c,m)L(c,m  -1)  >  L(c,m)L(c,m  + 1) ;  (12) 

otherwise,  Ts(m+ 1)  is  used  to  re-estimate  zs(m).  Then  the  re-estimated  pitch  points  are  considered  as 
reliable  and  used  to  estimate  unreliable  pitch  points  in  their  neighboring  frames.  This  re-estimation 
process  stops  when  all  the  unreliable  pitch  points  have  been  re-estimated. 

The  second  row  in  Table  1  shows  the  effect  of  incoiporating  temporal  continuity  in  pitch  estimation 
with  the  methods  described  above.  Using  temporal  continuity  yields  consistent  performance 
improvement,  especially  for  the  female  utterance. 

V.  Iterative  Procedure 

Our  tandem  algorithm  first  generates  an  initial  estimate  of  pitch  contours  and  binary  masks  for  up  to 
two  sources.  It  then  improves  the  estimation  of  pitch  contours  and  masks  in  an  iterative  manner. 

A.  Initial  Estimation 

In  this  step,  we  first  generate  up  to  two  estimated  pitch  periods  in  each  time  frame.  Since  T-F  units 
dominated  by  a  periodic  signal  tend  to  have  high  cross-channel  correlations  of  the  filter  response  or  the 
response  envelope,  we  only  consider  T-F  units  with  high  cross-channel  correlations  in  this  estimation.  Let 
Zs,\(m)  and  zs, i{m)  represent  two  estimated  pitch  periods  at  frame  m,  and  L\(m)  and  L2(m)  the 
corresponding  labels  of  the  estimated  masks.  We  first  treat  all  the  T-F  units  with  high  cross-channel 
correlations  as  dominated  by  a  single  source.  That  is: 
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fl  C(c,  m)  >  0.985  or  CE(c, m)  >  0.985 

Afe'")  =  {o  else  <B) 

We  then  assign  the  time  delay  supported  by  most  active  T-F  units  as  the  first  estimated  pitch  period.  A 
unit  ucm  is  considered  supporting  a  pitch  candidate  r  if  the  corresponding  P(H0\rcm(  r))  is  higher  than  a 
threshold.  Accordingly  we  have: 

vsi(m)  =  argmax  EcA(c»m)  •  sgn (P(H0  \  rcm(z))  -  9P)  (14) 

T 

1  x  >  0 

where  sgn(x)  =  -<  0  x-0, 

- 1  x  <  0 

and  9P  is  a  threshold.  Intuitively,  we  can  set  0P  to  0.5.  However,  such  a  threshold  may  not  position  the 
estimated  pitch  period  close  to  the  true  pitch  period  because  P(H0\rcm( t))  tends  to  be  higher  than  0.5  in  a 
relatively  wide  range  centered  at  the  true  pitch  period  (see  Fig.  1(c)).  In  general  9P  needs  to  be  much 
higher  than  0.5  so  that  we  can  position  Ts,i(m)  accurately.  On  the  other  hand,  9P  cannot  be  too  high, 

otherwise  most  active  T-F  units  cannot  contribute  to  this  estimation.  We  found  that  0.75  is  a  good 

compromise  that  allows  us  to  accurately  position  rs  ,(/ri)  without  ignoring  many  active  T-F  units. 

The  above  process  yields  an  estimated  pitch  at  many  time  frames  where  the  target  is  not  pitched.  The 
estimated  pitch  point  at  such  a  frame  is  usually  supported  by  only  a  few  T-F  units  unless  the  interference 
contains  a  strong  harmonic  signal  at  this  frame.  On  the  other  hand,  estimated  pitch  points  corresponding 
to  target  pitch  are  usually  supported  by  many  T-F  units.  In  order  to  remove  spurious  pitch  points,  we 
discard  a  detected  pitch  point  if  the  total  number  of  channels  supporting  this  pitch  point  is  less  than  a 
threshold.  We  found  that  an  appropriate  threshold  is  7  from  analyzing  the  training  data  set  (see  Sect. 
III.A).  Most  spurious  pitch  points  are  thus  removed.  At  the  same  time,  some  true  pitch  points  are  also 
removed,  but  most  of  them  will  be  recovered  in  the  following  iterative  process. 

With  the  estimated  pitch  period  Tsj(m),  we  re-estimate  the  mask  L\(m)  as: 
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L\(c,tri)  = 


1  P(Ho\rcm(TsAm)))>0-5 
0  else 


Then  we  use  the  T-F  units  that  do  not  support  the  first  pitch  period  Ts,\(jri)  to  estimate  the  second  pitch 
period,  rS;2(»i).  Specifically, 


fi  P(H0  |  rcm(rS  i (m)))  <  dP  and  ( C(c,m )  >  0.985  or  CE(c,m )  >  0.985) 

L2(c,m)  =  < (16) 
0  else 


We  let 


zsAm)  =  argmax^  4(0, m)  ■  sgn(P(H0  \  rcm(r))  -  GP) 


Again,  if  fewer  than  7  T-F  units  support  rsJjn),  we  set  it  to  0.  Otherwise,  we  re-estimate  L2(m)  as: 


L2(c,  m)  = 


1  ^okJ^W))>0.5 

0  else 


Flere  we  estimate  up  to  two  pitch  points  at  one  frame;  one  can  easily  extend  the  above  algorithm  to 
estimate  pitch  points  of  more  sources  if  needed. 

After  the  above  estimation,  our  algorithm  combines  the  estimated  pitch  periods  into  pitch  contours 
based  on  temporal  continuity.  Specifically,  for  estimated  pitch  periods  in  three  consecutive  frames, 
Ts,kl  (m  ~  1)  )  rs,k2  (OT)  )  anc*  ?s,k2  ( m  + 1)  >  where  k\,  k2.  and  k2  are  either  1  or  2,  they  are  combined  into  one 

pitch  contour  if  they  have  good  temporal  continuity  and  their  associated  masks  also  have  good  temporal 
continuity.  That  is, 

I  rS,k2  O)  -  Tsa  ( m  ~  ')  l<  0.2min(rs M  (m),  tsm  (m  - 1)) 

I  Ts,k ,  ("0  -  tsm  (m  +  ')  l<  °-2 min(r5,^  Ts,k3  ( m  +  1)) 

■  '  (19) 

Ec42  C c,m)Lh  (c , m  - 1)  >  0.5max(^c42  (, c,m),£cLki  (c,m  - 1)) 

Zc42  C c,m)Lh  (c, m  +  1)  >  0.5max(Xc42  (c,m), Zc43  +  0) 

The  remaining  isolated  estimated  pitch  points  are  considered  unreliable  and  set  to  0.  Note  that  requiring 
only  the  temporal  continuity  of  pitch  periods  cannot  prevent  connecting  pitch  points  from  different 
sources,  since  the  target  and  interference  may  have  similar  pitch  periods  at  the  same  time.  Flowever,  it  is 

very  unlikely  that  the  target  and  interference  have  similar  pitch  periods  and  occupy  the  same  frequency 
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region  at  the  same  time.  In  most  situations,  pitch  points  that  are  connected  according  to  (19)  do 
correspond  to  a  single  source.  As  a  result  of  this  step,  we  obtain  multiple  pitch  contours  and  each  pitch 
contour  has  an  associated  T-F  mask. 


B.  Iterative  Estimation 

In  this  step,  we  first  re-estimate  each  pitch  contour  from  its  associated  binary  mask.  A  key  step  in  this 
estimation  is  to  expand  estimated  pitch  contours  based  on  temporal  continuity,  i.e.,  using  reliable  pitch 
points  to  estimate  potential  pitch  points  at  neighboring  frames.  Specifically,  let  zk  be  a  pitch  contour  and 
Lk{m)  the  associated  mask.  Let  m\  and  mi  be  the  first  and  the  last  frame  of  this  pitch  contour.  To  expand 
zk,  we  first  let  Lk(m\-\)  =  Lk(ni\)  and  Lk(m2+ 1)  =  Lk(m2).  Then  we  re-estimate  zk  from  this  new  mask 
using  the  algorithm  described  in  Sect.  IV.B.  Re-estimated  pitch  periods  are  further  verified  according  to 
temporal  continuity  as  described  in  Sect.  IV. C  except  that  we  use  Eq.  (19)  instead  of  Eq.  (11)  for 
continuity  verification.  If  the  corresponding  source  of  contour  zk  is  pitched  at  frame  m\—\,  our  algorithm 
likely  yields  an  accurate  pitch  estimate  at  this  frame.  Otherwise,  the  re-estimated  pitch  period  at  this 
frame  usually  cannot  pass  the  continuity  check,  and  as  a  result  it  is  discarded  and  zk  still  starts  from  frame 
777 1.  The  same  applies  to  the  estimated  pitch  period  at  frame  m2+l-  After  expansion  and  re-estimation,  two 
pitch  contours  may  have  the  same  pitch  period  at  the  same  frame  and  therefore  they  are  combined  into 
one  pitch  contour. 

Then  we  re-estimate  the  mask  for  each  pitch  contour  as  follows.  First,  we  compute  the  probability  of 
each  T-F  unit  dominated  by  the  corresponding  source  of  a  pitch  contour  k,  P(H0  \  {P(H0  \  rc,m.(zk(rn')))}), 
as  described  in  Sect.  III.C.  Then  we  estimate  the  mask  for  contour  k  according  to  the  obtained 
probabilities: 


fl 


Lk(c,m)  =  \ 


0 


k  =  arg  max  P(  H0  \  {P(H0  \  rcW(zk,(m')))})  and 

k' 

P(H0  |  {P(H0  |  rcW(zk(m')))})  >  0.5 
else 


(20) 
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Usually  the  estimation  of  both  pitch  and  mask  converges  after  a  small  number  of  iterations,  typically 
smaller  than  20.  Sometimes  this  iterative  procedure  runs  into  a  cycle  where  there  are  slight  cyclic  changes 
for  both  estimated  pitch  and  estimated  mask  after  each  iteration.  In  our  implemention,  we  stop  the 
procedure  after  it  converges  or  20  iterations. 

C.  Incorporating  segmentation 

So  far,  unit  labeling  does  not  take  into  account  of  T-F  segmentation,  which  refers  to  a  stage  of 
processing  that  breaks  the  auditory  scene  into  contiguous  T-F  regions  each  of  which  contains  acoustic 
energy  mainly  from  a  single  sound  source  [36].  By  producing  an  intermediate  level  of  representation 
between  individual  T-F  units  and  sources,  segmentation  has  been  demonstrated  to  improve  segregation 
performance  [16].  Flere,  we  apply  a  segmentation  step  after  the  iterative  procedure  stops.  Specifically,  we 
employ  a  multiscale  onset/offset  based  segmentation  algorithm  [18]  that  produces  segments  enclosed  by 
detected  onsets  and  offsets.  After  segments  are  produced,  we  form  T-segments  which  are  segments  within 
individual  frequency  channels.  T-segments  strike  a  reasonable  balance  between  accepting  target  and 
rejecting  interference  [15]  [19].  With  obtained  T-segments,  we  label  the  T-F  units  within  a  T-segment 
wholly  as  target  if  (a)  more  than  half  of  T-segment  energy  is  included  in  the  voiced  frames  of  the  target, 
and  (b)  more  than  half  of  the  T-segment  energy  in  the  voiced  frames  is  included  in  the  active  T-F  units 
according  to  (9).  If  a  T-segment  fails  to  be  labeled  as  the  target,  we  still  treat  individual  active  T-F  units 
as  the  target. 

Fig.  4  shows  the  detected  pitch  contours  for  a  mixture  of  the  female  utterance  used  in  Fig.  1  and  crowd 
noise  at  0  dB  SNR.  The  mixture  is  illustrated  in  Fig.  5,  where  Figs.  5(a)  and  5(b)  show  the  cochleagram 
and  the  waveform  of  the  female  utterance  and  Figs.  5(c)  and  5(d)  the  cochleagram  and  the  waveform  of 
the  mixture.  In  Fig.  4,  we  use  the  pitch  points  detected  by  Praat  from  the  clean  utterance  as  the  ground 
truth  of  the  target  pitch.  As  shown  in  the  figure,  our  algorithm  correctly  estimates  most  of  target  pitch 
points.  At  the  same  time,  it  also  yields  one  pitch  contour  for  interference  (the  one  overlapping  with  no 
target  pitch  point).  Figs.  5(e)  and  5(g)  show  the  obtained  masks  for  the  target  utterance  in  the  mixture 
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Figure  4.  Estimated  pitch  contours  for  the  mixture  of  one  female  utterance  and  crowd  noise. 

without  and  with  incoiporating  segmentation,  respectively.  Comparing  the  mask  in  Fig.  5(e)  with  the 
ideal  binary  mask  shown  in  Fig.  5(i),  we  can  see  that  our  system  is  able  to  segregate  most  voiced  portions 
of  the  target  without  including  much  interference.  These  two  masks  yield  similar  resynthesized  targets  in 
the  voiced  intervals,  as  shown  in  Figs.  5(f)  and  5(j).  By  using  T-segments,  the  tandem  algorithm  is  able  to 
recover  even  more  target  energy,  but  at  the  expense  of  adding  a  small  amount  of  the  interference,  as 
shown  in  Figs.  5(g)  and  5(h).  Note  that  the  output  consists  of  several  pitch  contours  and  their  associated 
masks.  To  determine  whether  a  segregated  sound  is  part  of  the  target  speech  is  the  task  of  sequential 
grouping  [6]  [36],  which  is  beyond  the  scope  of  this  paper.  The  masks  in  Fig.  5(e)  and  Fig.  5(g)  are 
obtained  by  assuming  perfect  sequential  grouping. 

VI.  Evaluation 

A.  Pitch  estimation 

We  first  evaluate  the  tandem  algorithm  on  pitch  determination  with  utterances  from  the  FDA 
Evaluation  Database  [1].  This  database  was  collected  for  evaluating  pitch  determination  algorithms  and 
provides  accurate  target  pitch  contours  derived  from  laryngograph  data.  The  database  contains  utterances 
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Figure  5.  Segregation  illustration,  (a)  Cochleagram  of  a  female  utterance  showing  the  energy  of  each 
T-F  unit  with  brighter  pixel  indicating  stronger  energy,  (b)  Waveform  of  the  utterance,  (c) 
Cochleagram  of  the  utterance  mixed  with  a  crowd  noise,  (d)  Waveform  of  the  mixture,  (e)  Mask  of 
segregated  voiced  target  where  1  is  indicated  by  white  and  0  by  black,  (f)  Waveform  of  the  target 
resynthesized  with  the  mask  in  (e).  (g)  Mask  of  the  target  segregated  after  using  T-segments.  (h) 
Waveform  of  the  target  resynthesized  with  the  mask  in  (g).  (i)  Ideal  binary  mask.  (])  Waveform 
resynthesized  from  the  IBM  in  (i). 
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from  two  speakers,  one  male  and  one  female.  We  randomly  select  one  sentence  that  is  uttered  by  both 
speakers.  These  two  utterances  are  mixed  with  a  set  of  20  intrusions  at  different  SNR  levels.  These 
intrusions  are:  N1  -  white  noise,  N2  -  rock  music,  N3  -  siren,  N4  -  telephone,  N5  -  electric  fan,  N6  - 
clock  alarm,  N7  -  traffic  noise,  N8  -  bird  chirp  with  water  flowing,  N9  -  wind,  N10  -  rain,  Nil  - 
cocktail  party  noise,  N12  -  crowd  noise  at  a  playground,  N13  -  crowd  noise  with  music,  N14  -  crowd 
noise  with  clap,  N15  -  babble  noise  (16  speakers),  N16-N20  -  5  different  utterances  (see  [15]  for  details). 
These  intrusions  have  a  considerable  variety:  some  are  noise-like  (N9,  Nil)  and  some  contain  strong 
harmonic  sounds  (N3,  N5).  They  form  a  reasonable  corpus  for  testing  the  capacity  of  a  CASA  system  in 
dealing  with  various  types  of  interference. 

Fig.  6(a)  shows  the  average  correct  percentage  of  pitch  determination  with  the  tandem  algorithm  on 
these  mixtures  at  different  SNR  levels.  In  calculating  the  correct  detection  percentage,  we  only  consider 
estimated  pitch  contours  that  match  the  target  pitch:  an  estimated  pitch  contour  matches  target  pitch  if  at 
least  half  of  its  pitch  points  match  the  target  pitch,  i.e.,  the  target  is  pitched  at  these  corresponding  frames 
and  the  estimated  pitch  periods  differ  from  the  true  target  pitch  periods  by  less  than  5%.  As  shown  in  the 
figure,  the  tandem  algorithm  is  able  to  detect  69.1%  of  target  pitch  even  at  -5  dB  SNR.  The  correct 
detection  rate  increases  to  about  83.8%  as  the  SNR  increases  to  15  dB.  In  comparison,  Fig.  6(a)  also 
shows  the  results  using  Praat  and  from  a  multiple  pitch  tracking  algorithm  by  Wu  et  al.  [37],  which 
produces  competitive  performance  [21]  [22].  Note  that  the  Wu  et  al.  algorithm  does  not  yield  continuous 
pitch  contours.  Therefore,  the  correct  detection  rate  is  computed  by  comparing  estimated  pitch  with  the 
ground  truth  frame  by  frame.  As  shown  in  the  figure,  the  tandem  algorithm  performs  consistently  better 
than  the  Wu  et  al.  algorithm  at  all  SNR  levels.  The  tandem  algorithm  is  more  robust  to  interference 
compared  to  Praat,  whose  performance  is  good  at  SNR  levels  above  10  dB,  but  drops  quickly  as  SNR 
decreases. 

Besides  the  detection  rate,  we  also  need  to  measure  how  well  the  system  separates  pitch  points  of 
different  sources.  Fig.  6(b)  shows  the  percentage  of  mismatch,  which  is  the  percentage  of  estimated  pitch 
points  that  do  not  match  the  target  pitch  among  pitch  contours  matching  the  target  pitch.  An  estimated 
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Figure  6.  Results  of  pitch  determination  for  different  algorithms,  (a)  Percentage  of  correct 
detection,  (b)  Percentage  of  mismatch,  (c)  Number  of  contours  that  match  the  target  pitch. 


pitch  point  is  counted  as  mismatch  if  either  target  is  not  pitched  at  the  corresponding  frame  or  the 
difference  between  the  estimated  pitch  period  and  the  true  period  is  more  than  5%.  As  shown  in  the  figure, 
the  tandem  algorithm  yields  a  low  percentage  of  mismatch,  which  is  slightly  lower  than  that  of  Praat 
when  the  SNR  is  above  5  dB  SNR.  In  lower  SNR  levels,  Praat  has  a  lower  percentage  of  mismatch 
because  it  detects  fewer  pitch  points.  Note  that  the  Wu  et  al.  algorithm  does  not  generate  pitch  contours, 
and  the  mismatch  rate  is  0.  In  addition,  Fig.  6(c)  shows  the  average  number  of  estimated  pitch  contours 
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that  match  target  pitch  contours.  The  actual  average  number  of  target  pitch  contours  is  5.  The  tandem 
algorithm  yields  an  average  of  5.6  pitch  contours  for  each  mixture.  This  shows  that  the  algorithm  well 
separates  target  and  interference  pitch  without  dividing  the  former  into  many  short  contours.  Praat  yields 
almost  the  same  numbers  of  contours  as  the  actual  ones  at  1 5  dB  SNR.  However,  it  detects  fewer  contours 
when  the  mixture  SNR  drops.  Overall,  the  tandem  algorithm  yields  better  performance  than  either  Praat 
or  the  Wu  et  al.  algorithm,  especially  at  low  SNR  levels. 

To  illustrate  the  advantage  of  the  iterative  process  for  pitch  estimation,  we  present  the  average 
percentage  of  correct  detection  for  the  above  mixtures  at  -5  dB  with  respect  to  the  number  of  iterations  in 
the  first  row  of  Table  2.  Here  0  iteration  corresponds  to  the  result  of  initial  estimation,  and  “convergence” 
corresponds  to  the  final  output  of  the  algorithm.  As  shown  in  the  table,  the  initial  estimation  already  gives 
a  good  pitch  estimate.  The  iterative  procedure,  however,  is  able  to  improve  the  detection  rate,  especially 
in  the  first  iteration.  Overall,  the  procedure  increases  the  detection  rate  by  6.1  percentage  points.  It  is 
worth  pointing  out  that  the  improvement  varies  considerably  among  different  mixtures,  and  the  largest 
improvement  is  22.1  percentage  points. 

Table  2.  Performance  of  the  tandem  algorithm  with  respect  to  the  number  of  iteration 


Iteration  No. 

0 

1 

2 

3 

4 

Convergence 

Percentage  of 
detection 

63.0 

66.3 

67.8 

68.8 

68.9 

69.1 

SNR  (dB) 

6.97 

7.44 

7.62 

7.77 

7.89 

8.04 

B.  Voiced  Speech  Segregation 

The  performance  of  the  system  on  voiced  speech  segregation  has  been  evaluated  with  a  test  coipus 
containing  20  target  utterances  from  the  test  part  of  the  TIMIT  database  mixed  with  the  20  intrusions 
described  in  the  previous  section. 

The  estimated  target  masks  are  obtained  by  assuming  perfect  sequential  grouping.  Since  our 
computational  goal  here  is  to  estimate  the  IBM,  we  evaluate  segregation  performance  by  comparing  the 
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estimated  mask  to  the  IBM  with  two  measures  [16]. 

•  The  percentage  of  energy  loss,  PEL,  which  measures  the  amount  of  energy  in  the  active  T-F  units 
that  are  labeled  as  interference  relative  to  the  total  energy  in  active  units. 

•  The  percentage  of  noise  residue,  Pm,  which  measures  the  amount  of  energy  in  the  inactive  T-F 
units  that  are  labeled  as  target  relative  to  the  total  energy  in  inactive  units. 

Pel  and  PNu  provide  complementary  error  measures  of  a  segregation  system  and  a  successful  system 
needs  to  achieve  low  errors  in  both  measures. 

In  addition,  to  compare  waveforms  directly  we  measure  the  SNR  of  the  segregated  voiced  target  in 
decibels  [16]: 


SNR  -  101og10 


E„l An)-Sy(n)f 


(21) 


where  s(ii)  is  the  target  signal  resynthesized  from  the  IBM  and  Sy(n)  is  the  segregated  voiced  target. 

The  results  from  our  system  are  shown  in  Fig.  7.  Each  point  in  the  figure  represents  the  average  value 
of  400  mixtures  in  the  test  corpus  at  a  particular  SNR  level.  Figs.  7(a)  and  7(b)  show  the  percentages  of 
energy  loss  and  noise  residue.  Note  than  since  our  goal  here  is  to  segregate  voiced  target,  the  PEl  values 
here  are  only  for  the  target  energy  at  the  voiced  frames  of  the  target. 

As  shown  in  the  figure,  our  system  segregates  78.3%  of  voiced  target  energy  at  -5  dB  SNR  and  99.2% 
at  15  dB  SNR.  At  the  same  time,  11.2%  of  the  segregated  energy  belongs  to  intrusion  at  -5  dB.  This 
number  drops  to  0.6%  at  15  dB  SNR.  Fig.  7(c)  shows  the  SNR  of  the  segregated  target.  Our  system 
obtains  an  average  12.2  dB  gain  in  SNR  when  the  mixture  SNR  is  -5  dB.  This  gain  drops  to  3.3  dB  when 
the  mixture  SNR  is  10  dB.  Note  that  at  15  dB,  our  system  does  not  improve  the  SNR  because  most 
unvoiced  speech  is  not  segregated.  Figure  7  also  shows  the  result  of  the  algorithm  without  using  T- 
segments  in  the  final  estimation  step  (“Neighborhood”).  As  shown  in  the  figure,  the  corresponding 
segregated  target  loses  more  target  energy,  but  contains  less  interference.  The  SNR  performance  is  a  little 
better  by  incoiporating  T-segments. 

Fig.  7  also  shows  the  performance  using  our  previous  voiced  speech  segregation  system  [16],  which  is 
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Figure  7.  Results  of  voiced  speech  segregation,  (a)  Percentage  of  energy  loss  on  voiced  target,  (b) 
Percentage  of  noise  residue,  (c)  SNR  of  segregated  voiced  target. 


a  representative  CASA  system.  Because  the  previous  system  can  only  track  one  pitch  contour  of  the 
target,  in  this  implementation  we  provide  target  pitch  estimated  by  applying  Praat  to  clean  utterances.  As 
shown  in  the  figure,  the  previous  system  yields  a  lower  percentage  of  noise  residue,  but  has  a  much  higher 
percentage  of  energy  loss.  Even  with  provided  target  pitch,  the  previous  system  does  not  perform  as  well 
as  the  tandem  algorithm,  especially  at  higher  input  SNR  levels. 

To  illustrate  the  effect  of  iterative  estimation,  we  present  the  average  SNR  for  the  mixtures  of  two 
utterances  and  all  the  intrusions  at  -5  dB  SNR  in  the  second  row  of  Table  2.  On  average,  the  tandem 
algorithm  improves  the  SNR  by  1.07  dB.  Again,  the  SNR  improvement  varies  considerably  among 
different  mixtures,  and  the  largest  improvement  is  7.27  dB. 
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Intrusion  type 

Figure  8.  SNR  results  for  segregated  speech  and  original  mixtures  for  a  corpus  of  voiced 
speech  and  various  intrusions. 

As  an  additional  benchmark,  we  have  evaluated  the  tandem  algorithm  on  a  coipus  of  100  mixtures 
composed  of  10  target  utterances  mixed  with  10  intrusions  [10].  Every  target  utterance  in  the  coipus  is 
totally  voiced  and  has  only  one  pitch  contour.  The  intrusions  have  a  considerable  variety;  specifically  they 
are:  NO  -  1  kflz  pure  tone,  N1  -  white  noise,  N2  -  noise  bursts,  N3  -  “cocktail  party”  noise,  N4  -  rock 
music,  N5  -  siren,  N6  -  trill  telephone,  N7  -  female  speech,  N8  -  male  speech,  and  N9  -  female  speech. 
The  average  SNR  of  the  entire  coipus  is  3.28  dB.  This  coipus  is  commonly  used  in  CASA  for  evaluating 
voiced  speech  segregation  [8]  [16]  [24],  The  average  SNR  for  each  intrusion  is  shown  in  Fig.  8,  compared 
with  those  of  the  original  mixtures,  our  previous  system,  and  a  spectral  subtraction  method.  Note  that  here 
our  previous  system  extracts  pitch  contours  from  mixtures  instead  of  using  pitch  contours  extracted  from 
clean  utterances  with  Praat.  Spectral  subtraction  is  a  standard  method  for  speech  enhancement  [20]  (see 
also  [16]).  The  tandem  algorithm  performs  consistently  better  than  spectral  subtraction,  and  our  previous 
system  except  for  N4.  On  average,  the  tandem  algorithm  obtains  a  13.4  dB  SNR  gain,  which  is  about  1.9 
dB  better  than  our  previous  system  and  8.3  dB  better  than  spectral  subtraction. 
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VII.  Concluding  Remarks 


This  study  concentrates  on  voiced  speech,  and  does  not  deal  with  unvoiced  speech.  In  a  recent  paper, 
we  developed  a  model  for  separating  unvoiced  speech  from  nonspeech  interference  on  the  basis  of 
auditory  segmentation  and  feature-based  classification  [19].  This  unvoiced  segregation  system  operates 
on  the  output  of  voiced  speech  segregation,  which  was  provided  by  Hu  and  Wang  [17]  assuming  the 
availability  of  target  pitch  contours.  The  system  in  [17]  is  a  simplified  and  slightly  improved  version  of 
[16].  We  have  substituted  the  voiced  segregation  component  of  [19]  by  the  tandem  algorithm  [15].  The 
combined  system  produces  segregation  results  for  both  voiced  and  unvoiced  speech  that  are  as  good  as 
those  reported  in  [19],  but  with  detected  pitch  contours  rather  than  a  priori  pitch  contours  (see  [15]  for 
details). 

A  natural  speech  utterance  contains  silent  gaps  and  other  intervals  masked  by  interference.  In  practice, 
one  needs  to  group  the  utterance  across  such  time  intervals.  This  is  the  problem  of  sequential  grouping  [6] 
[36].  This  study  does  not  address  the  problem  of  sequential  grouping.  The  system  in  [19]  handles  the 
situation  of  nonspeech  interference  but  not  applicable  to  mixtures  of  multiple  speakers.  Sequentially 
grouping  segments  or  masks  could  be  achieved  by  using  speech  recognition  in  a  top-down  manner  (also 
limited  to  nonspeech  interference)  [2]  or  by  speaker  recognition  using  trained  speaker  models  [32]. 
Nevertheless,  these  studies  are  not  yet  mature,  and  substantial  effort  is  needed  in  the  future  to  frilly 
address  the  problem  of  sequential  grouping. 

To  conclude,  we  have  proposed  an  algorithm  that  estimates  target  pitch  and  segregates  voiced  target  in 
tandem.  This  algorithm  iteratively  improves  the  estimation  of  both  target  pitch  and  voiced  target.  The 
tandem  algorithm  is  robust  to  interference  and  produces  good  estimates  of  both  pitch  and  voiced  speech 
even  in  the  presence  of  strong  interference.  Systematic  evaluation  shows  that  the  tandem  algorithm 
performs  significantly  better  than  previous  CASA  systems.  Together  with  our  previous  system  for 
unvoiced  speech  segregation  [19],  we  have  a  complete  CASA  system  to  segregate  speech  from  various 
types  of  nonspeech  interference. 
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